Week 13
Advanced Applications and Best Practices

SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research


Semester 1, 2026
Last updated: 2026-01-23

Francesco Bailo

Acknowledgement of Country

I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.

Learning Objectives

By the end of this lecture, you will be able to:

  • Share and document data properly and ethically
  • Understand the basics of cross-validation for model evaluation
  • Apply best practices in statistical workflow
  • Create fully reproducible research projects
  • Communicate findings effectively to various audiences

Storing and Sharing Data

Why Share Data?

The Case for Data Sharing

  • A reluctance to share data is associated with research papers that had weaker evidence and more potential errors (Wicherts, Bakker, and Molenaar 2011)
  • Sharing data enhances credibility by enabling verification
  • Shared data can generate new knowledge as others use it to answer different questions

“If we can get our dataset off our own computer, then we are much of the way there.”

— Alexander (2023), TSwD

The FAIR Principles

The FAIR principles guide data sharing and management:

  1. Findable: One unchanging identifier (DOI) with high-quality descriptions
  1. Accessible: Standardised retrieval approaches that are open and free
  1. Interoperable: Broadly-applicable language and vocabulary
  1. Reusable: Extensive descriptions with clear usage conditions

Sharing via GitHub

The easiest way to start sharing data is through GitHub:

# Reading data directly from a GitHub repository
data_location <- paste0(
  "https://raw.githubusercontent.com/",
  "username/repository/main/data/dataset.csv"
)

my_data <- read_csv(file = data_location)

Benefits of GitHub for Data

  • Already integrated into your workflow
  • Stores raw data, cleaned data, AND transformation scripts
  • Meets the “bronze” standard of reproducibility

Creating R Packages for Data

R packages can be used to share datasets with documentation:

# Install a data package from GitHub
devtools::install_github("username/favcolordata")

# Load and use the data
library(favcolordata)
head(color_data)

Advantages of Data Packages

  • Documentation travels with the data
  • Clear data dictionary built in
  • Easy installation and loading
  • Examples: babynames, troopdata

Data Repositories

For more formal sharing, deposit your data in a repository:

Repository Features
Zenodo Free, operated by CERN, provides DOI
OSF (Open Science Framework) Free, integrates with GitHub
Harvard Dataverse Common for journal publications
Australian Data Archive Australian-specific repository

Why Use Repositories?

  • Provides a persistent DOI for citation
  • Offloads responsibility for hosting
  • Establishes a single point of truth
  • Makes access independent of original researchers

Data Documentation: Datasheets

Data Dictionary = List of ingredients

  • Variable names
  • Descriptions
  • Data types
  • Sources

Datasheet = Nutrition label

  • Who created it?
  • Who funded it?
  • How complete is it?
  • What are the limitations?

Datasheets for Datasets (Gebru et al. 2021)

Just as electronics come with datasheets, datasets should come with documentation that enables users to understand what they’re working with.

Protecting Privacy

Personally Identifying Information (PII)

PII enables linking observations to actual people:

  • Email addresses, names, home addresses
  • Combinations of variables (age + location + education)
  • Even seemingly innocuous variables at extremes

Protection methods:

  1. Hashing: One-way transformation (e.g., MD5)
  2. Simulation: Release synthetic data that preserves statistical properties
  3. Differential privacy: Mathematical guarantees of privacy

File Formats: CSV vs Parquet

library(arrow)

# Write to parquet
write_parquet(my_data, "data/analysis_data.parquet")

# Read from parquet
my_data <- read_parquet("data/analysis_data.parquet")
Aspect CSV Parquet
File size Larger Smaller (3-4x)
Speed Slower Faster
Data types Lost Preserved
Human readable Yes No

Cross-Validation

Why Cross-Validation?

The Problem with Using Training Data

When we evaluate predictions on the same data used to fit the model, predictions are optimistically biased for assessing generalisation.

Cross-validation addresses this by:

  • Using part of the data to fit the model
  • Using the rest (the hold-out set) as a proxy for future data
  • Assessing how well the model generalises

Leave-One-Out Cross-Validation (LOO)

# Fit model to all data
fit_all <- stan_glm(y ~ x, data = fake)

# Fit model excluding observation 18
fit_minus_18 <- stan_glm(y ~ x, data = fake[-18,])

How LOO Works

  1. Remove one observation from the data
  2. Fit the model to the remaining data
  3. Predict the held-out observation
  4. Repeat for all observations
  5. Average the prediction errors

LOO Residuals vs Regular Residuals

Regular residuals

  • Observation minus predicted value
  • From model fit to ALL data
  • Tend to be smaller (optimistic)

LOO residuals

  • Observation minus predicted value
  • From model fit WITHOUT that observation
  • More honest assessment

LOO R²

Just as we have R², we can calculate LOO R² using LOO residuals. This gives a more realistic estimate of explained variance for new data.

Using the loo Package

library(loo)

# Fit model
fit <- stan_glm(kid_score ~ mom_hs + mom_iq, data = kidiq)

# Compute LOO cross-validation
loo_result <- loo(fit)
print(loo_result)

Output includes:

  • elpd_loo: Expected log predictive density (higher = better)
  • p_loo: Effective number of parameters
  • looic: LOO information criterion (-2 × elpd_loo)

Comparing Models with LOO

# Fit two models
fit_1 <- stan_glm(kid_score ~ mom_hs, data = kidiq)
fit_2 <- stan_glm(kid_score ~ mom_hs + mom_iq, data = kidiq)

# Compare
loo_compare(loo(fit_1), loo(fit_2))

Interpreting Model Comparison

  • A difference in elpd_loo larger than 4 is meaningful
  • Standard errors help assess uncertainty
  • Consider the trade-off between fit and complexity

K-Fold Cross-Validation

When LOO is unstable (warning messages), use K-fold:

# 10-fold cross-validation
kfold_result <- kfold(fit, K = 10)
print(kfold_result)

How K-Fold Works

  1. Randomly partition data into K subsets (typically K = 10)
  2. For each fold: fit model on K-1 subsets, predict the held-out subset
  3. Average the prediction errors across all folds

Overfitting and Noise Predictors

# Add 5 noise predictors
kidiqr$noise <- array(rnorm(5*n), c(n, 5))
fit_noise <- stan_glm(kid_score ~ mom_hs + mom_iq + noise, 
                       data = kidiqr)

What Happens with Noise Predictors?

Metric Original With Noise
0.21 0.22 ↑
LOO R² 0.20 0.19 ↓
Log score -1872 -1871 ↑
LOO log score -1876 -1880 ↓

Adding noise improves in-sample fit but hurts cross-validated performance!

10 Quick Tips for Regression

Tip 1: Think About Variation and Replication

Variation is Central

  • Fitting the same model to different datasets reveals variation across problems
  • Replication means performing ALL steps from the start, not just increasing sample size
  • A fresh perspective helps avoid “forking paths” in analysis

In observational sciences like economics and political science, replication can be more indirect—for example, analysing local economic activity within different countries.

Tip 2: Forget About Statistical Significance

Three Reasons to Move Beyond p-values

  1. Discretising based on significance tests throws away information

  2. In real problems, there are no true zeroes—everything that could have an effect does have some effect

  3. Comparisons and effects vary by context, so excluding zero isn’t particularly informative

Focus instead on:

  • Effect sizes and their uncertainty
  • Practical significance
  • Variation across contexts

Tip 3: Graph the Relevant, Not the Irrelevant

Do graph:

  • The fitted model itself
  • Regression lines/curves
  • Model predictions
  • Multiple visualisations

Don’t obsess over:

  • Q-Q plots (unless you’ll act on them)
  • Influence diagrams
  • Every diagnostic from the package

Rule of Thumb

Any graph you show, be prepared to explain. If you can’t explain why it matters, don’t include it.

Tip 4: Interpret Coefficients as Comparisons

Coefficients Are Not “Effects”

A regression coefficient is the modelled average difference in the outcome, comparing two individuals that differ in one predictor while being at the same levels of all other predictors.

Benefits of this framing:

  • Always available as a data description (no causal assumptions needed)
  • Helps understand complex models as built from simpler comparisons
  • Works for causal inference as a special case

Tip 5: Use Fake-Data Simulation

# Simulate data to understand your methods
set.seed(853)
n <- 100
x <- rnorm(n)
y <- 2 + 0.5*x + rnorm(n, sd = 1)
fake <- tibble(x, y)

# Fit model and check if you recover true parameters
fit <- lm(y ~ x, data = fake)
summary(fit)

Payoffs of Simulation

  1. Forces you to think about realistic parameter values
  2. Lets you study properties of methods under repeated sampling
  3. Helps debug your code—if you can’t recover true parameters, something’s wrong

Tip 6: Fit Many Models

Start Simple, Build Complexity

  • Begin with too-simple models
  • Build up to more complex models
  • Keep track of what you’ve tried
  • Report results from ALL relevant models

“It’s rarely a good idea to run the computer overnight fitting a single model. At least, wait until you’ve developed some understanding by fitting many models.”

Tip 7: Set Up a Computational Workflow

Practical strategies:

  • Data subsetting: Break large datasets into subsets, analyse separately, then combine
  • Efficient computation: Use faster approximations while exploring
  • Fake-data simulation: Debug code before running on real data

Advantages of Subsetting

  • Faster computation allows more model exploration
  • Can reveal interesting variation (e.g., regional differences)
  • Move to multilevel models later for efficiency

Tip 8: Use Transformations

Consider transforming variables:

  • Logarithms for all-positive variables (creates multiplicative models)
  • Standardising based on scale or potential range (for interpretability)
  • Interactions and combined predictors
# Log transformation for positively skewed data
model <- lm(log(income) ~ education + experience, data = survey)

# Standardised coefficients for comparison
model_std <- lm(scale(outcome) ~ scale(pred1) + scale(pred2), 
                data = df)

Tip 9: Do Causal Inference Carefully

Don’t Assume Causal Interpretation

  • A regression coefficient is NOT automatically a causal effect
  • If interested in causality, carefully define your treatment variable
  • Address balance and overlap between treated and control units
  • Consider treatment effect heterogeneity

Don’t set up one large regression to answer multiple causal questions at once—this is rarely appropriate in observational settings.

Tip 10: Learn Methods Through Live Examples

Apply Methods to Problems You Care About

  • Gather data on questions that interest you
  • Develop understanding through simulation and graphing
  • Know your data, measurements, and data-collection procedure
  • Understand magnitudes, not just signs

“You will need this understanding to interpret your findings and catch things that go wrong.”

Writing Research (Review)

The Process of Writing

Key Principles (from Week 8)

  1. Write for the reader, not yourself
  2. Get to a first draft as quickly as possible—even if it’s horrible
  3. Rewriting is the essence of writing
  4. Brevity matters—remove unnecessary words

“The process of writing is a process of rewriting. The critical task is to get to a first draft as quickly as possible.”

— Alexander (2023), TSwD

Paper Structure Reminder

Section Purpose
Title Tell your story in one line
Abstract 3-5 sentences covering context, methods, findings, implications
Introduction Self-contained overview—give away the punchline
Data Create a “sense of place” for your data
Model Specify and justify your approach
Results What you found (not what it means)
Discussion Implications, limitations, future work

Multilevel Regression with Post-Stratification (MRP)

What is MRP?

MRP in One Sentence

MRP uses a regression model to relate survey responses to characteristics, then rebuilds the sample to better match the population.

Why use MRP?

  • Allows “re-weighting” with proper uncertainty
  • Enables use of non-probability samples
  • Can speak to subgroups that may not be well-represented

The MRP Workflow

  1. Gather survey data thinking about what’s needed for post-stratification
  1. Gather post-stratification data (census or representative sample)
  1. Model the outcome using predictors available in BOTH datasets
  1. Apply the model to the post-stratification data and aggregate

Example: Xbox Survey (Wang et al. 2015)

A Famous MRP Example

  • 750,000+ interviews through Xbox gaming platform
  • 93% male, 65% aged 18-29 (vs 47% and 19% in population)
  • Used MRP to adjust this wildly unrepresentative sample
  • Successfully forecast the 2012 US Presidential Election

MRP is not magic—the laws of statistics still apply—but it can make biased data useful when applied carefully.

Concluding Remarks

What We’ve Covered This Semester

Weeks 1-5: Foundations

  • R and RStudio setup
  • Reproducible workflows
  • Data acquisition
  • Visualisation
  • Data cleaning
  • Probability simulation

Weeks 6-13: Modelling

  • Linear regression
  • Multiple regression
  • Model diagnostics
  • Logistic regression
  • Count models
  • Surveys and experiments
  • Causal inference
  • Best practices

The Central Thesis

What Makes Good Data Science?

“We consider data science to be the process of developing and applying a principled, tested, reproducible, end-to-end workflow that focuses on quantitative measures in and of themselves, and as a foundation to explore questions.”

— Alexander (2023), TSwD

  • Mathematical rigour = theorems with proofs
  • Data science rigour = claims with verified, tested, reproducible code and data

Outstanding Issues in Data Science

  1. How do we write effective tests? Testing in data science is still developing
  1. What happens at data cleaning? Hidden decisions have huge effects on results
  1. How do we create effective names? Standardised nomenclature is needed
  1. How do we teach data science? The field is still learning how to scale education
  1. What’s the relationship between industry and academia? Innovation happens in both

Next Steps for Your Learning

To Solidify Foundations:

  • R for Data Science (Wickham et al.)
  • Data Science: A First Introduction (Timbers et al.)
  • SQL and Python basics

To Go Deeper:

  • Statistical Rethinking (McElreath)
  • Regression and Other Stories (Gelman et al.)
  • Causal Inference: The Mixtape (Cunningham)

The Most Important Advice

Write code every day. The only way to get better at data analysis is to do data analysis.

Final Thoughts

“May You Live in Interesting Times”

“Data science needs to insist on diversity, both in terms of approaches and applications. It is increasingly the most important work in the world, and hegemonic approaches have no place.”

— Alexander (2023), TSwD

  • Take courses on fundamentals, not just fashionable applications
  • Read core texts, not just what’s trending
  • Be at the intersection of different areas

Thank You!

Key Takeaways from This Course

  1. Reproducibility is fundamental—code and data should be shareable
  2. Workflows matter—Plan → Simulate → Acquire → Explore → Share
  3. Models are tools—understand their assumptions and limitations
  4. Writing is essential—you tell stories with data through writing
  5. Keep learning—the field is evolving rapidly

Good luck with your assessments and future data science endeavours!

References

  • Alexander, R. (2023). Telling Stories with Data. CRC Press.
  • Gebru, T., et al. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86-92.
  • Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and Other Stories. Cambridge University Press.
  • Wang, W., et al. (2015). Forecasting elections with non-representative polls. International Journal of Forecasting, 31(3), 980-991.
  • Wicherts, J. M., Bakker, M., & Molenaar, D. (2011). Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PloS ONE, 6(11), e26828.
  • Wilkinson, M. D., et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018.